[Simple Embodied Language Learning as a Byproduct of Meta-Reinforcement Learning]https://arxiv.org/abs/2306.08400

Background Notes

meta-RL vs RL

When an ML algorithm’s name is prefixed with “meta” that means the algorithm is about learning to learn. Think back to MAML = “Model-Agnostic Meta-Learning” with its inner and outer learning loops with fast and slow learning rates, respectively. In an RL context, that means the agent is expected to learn how to abstractly solve a family of related tasks; so that when presented with a specific task, it can quickly (learn to) solve it. In the paper, the family of tasks is finding the office in a building, any building that they could draw for the agent to navigate, and a particular task is one particular building; a trial is a single exploration + evaluation in a building.

Dream Algorithm

https://ezliu.github.io/dream/, https://arxiv.org/abs/2008.02790
The goal of meta-reinforcement learning (meta-RL) is to build agents that can quickly learn new tasks by leveraging prior experience on related tasks. Learning a new task often requires both exploring to gather task-relevant information and exploiting this information to solve the task. In principle, optimal exploration and exploitation can be learned end-to-end by simply maximizing task performance. However, such meta-RL approaches struggle with local optima due to a chicken-and-egg problem: learning to explore requires good exploitation to gauge the exploration’s utility, but learning to exploit requires information gathered via exploration. Optimizing separate objectives for exploration and exploitation can avoid this problem, but prior meta-RL exploration objectives yield suboptimal policies that gather information irrelevant to the task. We alleviate both concerns by constructing an exploitation objective that automatically identifies task-relevant information and an exploration objective to recover only this information. This avoids local optima in end-to-end training, without sacrificing optimal exploration. Empirically, DREAM substantially outperforms existing approaches on complex meta-RL problems, such as sparse-reward 3D visual navigation.

“Indirectly learning language can be desirable as it is directly tied to real-world observations, which yields language understanding that is grounded and consistent with reality. In contrast, standard, directly-trained models can output linguistically correct, but factually incorrect sentences (Lin et al., 2021b).”

To paraphrase, the authors think that learning language as a tool to complete unrelated tasks will be less likely to hallucinate than a language model trained directly on language without this “grounding” in the real world.

“rather than rely on explicit language supervision like these studies or tasks requiring language skills, we study language learning in tasks without direct language supervision, which can be s
olved without any language.”

Those pictorial floor plans are practically unreadable and the “blue” square isn’t blue and the legend isn’t helping.
2st??